{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5984f9d7-0d49-40ff-8b41-730b386efd02",
   "metadata": {},
   "source": [
    "# Not a scaling law!\n",
    "\n",
    "We will play with the [Transformers](https://huggingface.co/docs/transformers/en/index) and [Datasets](https://huggingface.co/docs/datasets/en/index) librairies of Hugging Face.\n",
    "\n",
    "The question is: given a budget of compute, what is the impact of data scarcity?\n",
    "\n",
    "To simplify, we will add some constraints:\n",
    "- the number of parameters is fixed: you will fine-tune the distilGPT2 model.\n",
    "- we fix the maximum sentence length at 64 tokens and assume the compute budget allows you to pass at most 100 times one sentence forward and backward through the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f56083c-0297-45de-b339-dd24df4c6360",
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "import datasets\n",
    "from transformers import GPT2TokenizerFast, GPT2LMHeadModel, Trainer, TrainingArguments, DataCollatorForLanguageModeling"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d456c627-ab61-43aa-8ecc-b9c2c3e2e333",
   "metadata": {},
   "source": [
    "The code below loads the corpus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1c695c5d-caa4-4c2a-831c-05e30eaad2c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "t = GPT2TokenizerFast.from_pretrained('distilgpt2')\n",
    "t.pad_token = t.eos_token\n",
    "dc = DataCollatorForLanguageModeling(tokenizer=t, mlm=False)\n",
    "d0 = datasets.load_dataset(\"wikitext\",\"wikitext-2-v1\")\n",
    "dval = d0['validation']\n",
    "dtrain = d0['train']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6d27e538-723f-4b48-8470-57c82503eb09",
   "metadata": {},
   "source": [
    "The code below constructs a training dataset and a validation dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ddf67dd9-5484-4406-a471-9dbb04b65342",
   "metadata": {},
   "outputs": [],
   "source": [
    "slen = 64\n",
    "def tokenize(element):\n",
    "    outputs = t(element[\"text\"], truncation=True, max_length=slen, return_overflowing_tokens=True, return_length=True)\n",
    "    input_batch = []\n",
    "    for length, input_ids in zip(outputs[\"length\"], outputs[\"input_ids\"]):\n",
    "        if length == slen: input_batch.append(input_ids)\n",
    "    return {\"input_ids\": input_batch}\n",
    "dtrain = dtrain.map(tokenize, batched=True, remove_columns=dtrain.column_names)\n",
    "dval = dval.map(tokenize, batched=True, remove_columns=dval.column_names)\n",
    "print(\"training data\",d0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd039631-619d-4288-849e-31f6de2514c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "dval = dval.select([i for i in range(10)])\n",
    "print(\"validation data\",dval)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78b6608f-577d-4edc-9681-d32df1e13009",
   "metadata": {},
   "source": [
    "Here is an example of code for finetuning with the [`Trainer` class](https://huggingface.co/docs/transformers/en/main_classes/trainer)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5af95291-eeb2-4253-a3eb-b950c526ba6d",
   "metadata": {},
   "outputs": [],
   "source": [
    "d = dtrain.select([i for i in range(3)])\n",
    "model = GPT2LMHeadModel.from_pretrained('distilgpt2')\n",
    "trargs = TrainingArguments(\".\", do_train=True, num_train_epochs=5, per_device_train_batch_size=1, logging_steps=1, learning_rate=0.0001,\n",
    "            per_device_eval_batch_size=1, eval_strategy=\"steps\", eval_steps=1, report_to=\"none\")\n",
    "tr = Trainer(model=model, args=trargs, train_dataset=d, eval_dataset=dval, processing_class=t, data_collator=dc)\n",
    "tr.train()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cee3cc00-be49-4567-9d33-e3c705e805d3",
   "metadata": {},
   "source": [
    "Given the constraints above, what experiment should you do in order to see the impact of data scarcity? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "835bf5cb-e5c5-4da4-bed3-1e82ecb5b503",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "chess",
   "language": "python",
   "name": "chess"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
